-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor KNNCodec to use new extension point #319
Refactor KNNCodec to use new extension point #319
Conversation
Initial commit for plugin to return CodecServiceFactory as opposed to CodecService. This will allow the plugin to make decisions based on Mapper Service. Signed-off-by: John Mazanec <[email protected]>
Refactors the KNN87Codec to implement FilterCodec. This allows the codec to automatically/flexibly delegate operations it does not override to an arbitrary Codec. Additionally cleans up some code around the Codec Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Adds unit tests that map to each codec component. Did not add tests for merging and codec utils. This can be undertaken later. Adds a utils folder for sharing testing functionality between codec tests. Cleans up a few minor details around codec source code. Signed-off-by: John Mazanec <[email protected]>
|
||
if (field.attributes().containsKey(MODEL_ID)) { | ||
if (field.attributes().containsKey(MODEL_ID)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need null check for attributes before calling contains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so. Sole constructor requires non-null: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L103
*/ | ||
@Override | ||
public Codec codec(String name) { | ||
return new KNN87Codec(super.codec(name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some sort of factory here that will allow to select exact k-NN Codec implementation. I'm thinking of a use case for 2.0/lucene 9, we'll need to use Lucene90 codec for it, and probably will create new kNN wrapper class, something like KNN90Codec. Although it might be too much for this PR, we can do it in next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure of a case where we will want to build a Codec for an older version. For instance, when we upgrade to Lucene90, I dont think we will want to allow the option to build KNN87Codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see, so essentially we change it with every next Lucene version but keep older codecs for bwc. Seems that this method (codec(String name)) is the factory method itself. Do we have our abstraction for knn codec apart from lucene abstract Codec class? It may give us more flexibility with testing, although I'm not sure it's that easy to pass our interface to the OpenSearch through plugin related service classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so we keep older Codecs so that older indices can be read by old Codecs via SPI.
BWC is also why we need to version our Codecs (KNN87Codec). This is so that older indices can be read with old codecs.
I think this is something that could be investigated more in a future review.
org.opensearch.knn.index.codec.KNN80Codec.KNN80Codec | ||
org.opensearch.knn.index.codec.KNN84Codec.KNN84Codec | ||
org.opensearch.knn.index.codec.KNN86Codec.KNN86Codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we do not use these codecs in any older opendistro version? If these codecs were used to create segments in some older version then assuming they upgrade to latest OpenSearch version, it might fail as they do not find respective codec for reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think this could have problems for SPI.
Let me add back these Codecs back.
11085ec
to
8e99dd4
Compare
Signed-off-by: John Mazanec <[email protected]>
8e99dd4
to
0845b3d
Compare
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
df7ae05
to
7f5bf23
Compare
Signed-off-by: John Mazanec <[email protected]>
@martin-gaievski @vamshin Failure related to spotless check. Please review again - Ill fix spotless after I get approval |
// If the delegate is an instanceof Lucene87, we can get a field specific doc values format | ||
// delegate | ||
if (delegate instanceof Lucene87Codec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not doing this for every code >= Lucene87? Wondering why this should be only specific to Lucene87
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so Codec does not have the method getDocValuesFormatForField
. Lucene8XCodec's do, however. So, I felt like it made sense to map this to the LuceneCodec version our KNN codec version maps to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pull this to the separate DocValuesFormat class and inject to knn codec during construction time. We can also make getDocValuesFormatForField part of our codec facade abstraction, this should help to decrease coupling between our and lucene classes. These ideas are more for next refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-gaievski what would this look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually was able to remove this code. The reason beingthat the delegate could have the per field functionality and we dont need to do anything to get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. My initial idea was to create a facade interface for knn codec and then inject it to KNNCodecService using DI similar to spring or guice. I may try to play with it while working on integration with opensearch 2.0/lucene 9.1, we need a codec based on lucene 9.1
Signed-off-by: John Mazanec <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: John Mazanec <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #319 +/- ##
============================================
+ Coverage 83.49% 83.74% +0.24%
+ Complexity 889 885 -4
============================================
Files 127 126 -1
Lines 3829 3801 -28
Branches 361 360 -1
============================================
- Hits 3197 3183 -14
+ Misses 470 457 -13
+ Partials 162 161 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks
Refactor plugin to return CodecServiceFactory as opposed to CodecService. This will allow the plugin to make decisions based on Mapper Service. Refactors the KNN87Codec to implement FilterCodec. This allows the codec to automatically/flexibly delegate operations it does not override to an arbitrary Codec. Additionally cleans up some code around the Codec Adds unit tests that map to each codec component. Did not add tests for merging and codec utils. This can be undertaken later. Adds a utils folder for sharing testing functionality between codec tests. Cleans up a few minor details around codec source code. Signed-off-by: John Mazanec <[email protected]>
Description
This is a fairly large refactor of our current KNNCodec system.
First, it deletes unused code. For instance, we had KNN84Codec and KNN86Codec that we were not using. The reason to keep them had been potential assistance with backwards compatibility. However, given that we do not use them now, I think it makes sense to just delete.
Second, the plugin is refactored to implement getCustomCodecServiceFactory. This is a new extension point in OpenSearch and will allow multiple EnginePlugins to be used on the same index. For instance, the CCR plugin and k-NN. This allows us to delete the EngineFactory code. Along with this, I moved KNNCodecService to the codec folder for consistency.
Third, the KNN87Codec was refactored to implement the FilterCodec. The KNN87Codec uses the delegate pattern to override the DocValuesFormat. The rest of the formats are delegated to a delegate. This is what the FilterCodec is intended to be used for. Along with this, I extended the delegate reach further so that the KNN87Codec could implement the pattern to its fullest potential.
Fourth, I added several unit tests for the individual components of the KNN87Codec as well as a util class. This will allow us to cover to a fuller extent the test coverage around these components. I have not yet added unit test coverage for merging or for the KNNCodecUtil. I will do this in the future.
Note - I expect checkstyle to fail. My plan is to run
./gradlew spotlessApply
after I get review so as to not distract reviewers from the substantive changes.Issues Resolved
#146 #147
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.